Creates a new way to test multiple selections for each protein; parallelizes LDDT script by marcuscollins · Pull Request #116 · diff-use/sampleworks

marcuscollins · 2026-03-01T03:06:01Z

Sorry, this is a pretty big one. All it should do is what's in the title though.

Summary by CodeRabbit

New Features
- Parallelized LDDT evaluation across multiple selections per protein
- New CLI options: --target-filename and --n-jobs
Configuration Changes
- ProteinConfig.selection now supports multiple selections (parsed from CSV)
Improvements
- Per-selection reference loading and coordinate aggregation for robust comparisons
- Enhanced experiment filtering (case-insensitive) and per-selection error logging
Utilities
- Altloc utilities support optional selection filtering and stronger validation

# Conflicts: # src/sampleworks/core/rewards/real_space_density.py # Conflicts: # src/sampleworks/utils/atom_array_utils.py

Parallelizing LDDT script with joblib. Closes #98

coderabbitai · 2026-03-01T03:06:18Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 51ee774 and 2945059.

📒 Files selected for processing (6)

scripts/eval/rscc_grid_search_script.py
src/sampleworks/eval/eval_dataclasses.py
src/sampleworks/eval/grid_search_eval_utils.py
src/sampleworks/eval/structure_utils.py
src/sampleworks/utils/atom_array_utils.py
src/sampleworks/utils/density_utils.py

📝 Walkthrough

Walkthrough

Parallelizes LDDT evaluation by splitting per-protein selections into independent jobs, expands reference-loading and selection handling to be per-selection, updates dataclasses and selection utilities to support multiple selections, and aggregates per-selection results into a unified dataframe.

Changes

Cohort / File(s)	Summary
LDDT Parallelized Evaluation `scripts/eval/lddt_evaluation_script.py`	Added `process_exp_with_selection()` and joblib `Parallel` usage to run per-(experiment,selection) LDDT evaluations; changed main flow to preload nested reference caches keyed by (protein,occ,selection), filter experiments case-insensitively, and aggregate null + parallel results.
RSCC Grid Search `scripts/eval/rscc_grid_search_script.py`	Switched from hard-coded `OCCUPANCY_LEVELS` to `args.occupancies`; discovery accepts `target_filename`; per-selection result dicts and base_map cache now include selection; added robust error capture per-result.
Dataclass: Multi-selection `src/sampleworks/eval/eval_dataclasses.py`	`ProteinConfig.selection` changed from `str` to `list[str]`; `from_csv` now splits selection on `;`; added validation to ensure non-empty selections.
Reference / Selection Handling `src/sampleworks/eval/structure_utils.py`	`get_reference_structure_coords()` now returns `dict[str, np.ndarray]` keyed by selection; added masked selection support, `ATOMWORKS_COMPARISON_OPS`, per-selection loading/diagnostics, and deprecation guard for old-style strings.
Altloc & AtomArray Utilities `src/sampleworks/utils/atom_array_utils.py`	`find_all_altloc_ids()` gains `altloc_label` param; `map_altlocs_to_stack()` gains `selection` and `return_full_array` params, adds input validation, and implements selection-aware masking/filtering before stacking.
CLI / Grid Search Utils `src/sampleworks/eval/grid_search_eval_utils.py`	Added CLI options `--target-filename` (default `refined.cif`) and `--n-jobs` (default 16); updated occupancies help text and minor TODO/FIXME comments.
Minor Edits / Comments `src/sampleworks/core/rewards/real_space_density.py`, `src/sampleworks/utils/density_utils.py`	Whitespace, comment tweaks and a TODO insertion; no behavioral changes.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI / Main Script
    participant Filter as Experiment Filter
    participant RefLoader as Reference Loader
    participant Parallel as joblib.Parallel
    participant Worker as process_exp_with_selection
    participant Aggregator as Result Aggregator

    CLI->>Filter: load experiments & apply case-insensitive protein matching
    Filter->>RefLoader: request reference coords per (protein, occ, selection)
    RefLoader-->>Filter: return nested dict [(protein,occ)][selection]->coords
    CLI->>Parallel: dispatch (exp, selection) tasks
    Parallel->>Worker: for each (exp,selection) -> load predicted structure
    Worker->>Worker: apply selection mask / clustering / compute LDDT
    Worker-->>Aggregator: return per-selection result dict (metrics or error)
    Aggregator->>Aggregator: combine results + null_results -> dataframe
    Aggregator-->>CLI: write CSV / return dataframe

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Creates a new way to test multiple selections for each protein; parallelizes LDDT script #116: Adds per-selection parallelization and process_exp_with_selection in LDDT evaluation — closely overlapping changes.
Remove phenix clashscore script and add occupancy/density utility improvements #112: Modifies grid search CLI/options and occupancy handling — overlaps grid_search_eval_utils and discovery logic.

Suggested reviewers

k-chrispens

Poem

🐰 Hop-hop — selections split and go,
Each atom masked in a parallel row.
Nested dicts hum, jobs dance in tune,
Results aggregate beneath the moon.
A rabbit cheers: reviews come soon! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly matches the main changes: multiple selections per protein support and LDDT script parallelization via joblib.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch new-selection-lists

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (2)

scripts/eval/lddt_evaluation_script.py (1)

223-241: Load each reference structure once per occupancy, then fan out by selection.

The current product loop reloads and reparses the same reference structure for every selection, which is avoidable overhead.

⚡ Proposed refactor

-    for protein_key, protein_config in protein_configs.items():
-        for occ, sel in itertools.product(args.occupancies, protein_config.selection):
-            ref_path, reference_proteins = get_reference_atomarraystack(protein_config, occ)
+    for protein_key, protein_config in protein_configs.items():
+        for occ in args.occupancies:
+            ref_path, reference_proteins = get_reference_atomarraystack(protein_config, occ)
             if reference_proteins is None:
                 logger.warning(
                     f"Could not find ref structure for {protein_key} and occupancy {occ}"
                 )
                 continue
+            if (protein_key, occ) not in reference_atom_arrays:
+                reference_atom_arrays[(protein_key, occ)] = {}
+
+            for sel in protein_config.selection:
+                try:
+                    reference_protein_stack, _, _ = map_altlocs_to_stack(
+                        reference_proteins,
+                        selection=translate_selection(sel),
+                        return_full_array=True,
+                    )
+                    reference_atom_arrays[(protein_key, occ)][sel] = reference_protein_stack
+                except Exception as e:
+                    logger.error(
+                        f"Error loading ref structure for {protein_key} and occupancy {occ}: {e}"
+                    )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/lddt_evaluation_script.py` around lines 223 - 241, The loop
currently calls get_reference_atomarraystack and reparses the same reference for
every selection because itertools.product over args.occupancies and
protein_config.selection repeats loads; change the logic to iterate over
occupancies first (for occ in args.occupancies), call
get_reference_atomarraystack(protein_config, occ) once, skip if None, then
inside that occupancy iterate over selections (for sel in
protein_config.selection) and call map_altlocs_to_stack(reference_proteins,
selection=translate_selection(sel), return_full_array=True) and store into
reference_atom_arrays[(protein_key, occ)][sel]; ensure the existing caching
check for (protein_key, occ) remains so each reference is only stored once per
occupancy.

scripts/eval/rscc_grid_search_script.py (1)

127-165: Avoid recomputing refined density once per selection.

Parsing CIF and recomputing density inside the selection loop scales linearly with number of selections and can dominate runtime. Compute predicted density once per experiment, then only run per-selection extraction/RSCC.

⚡ Refactor sketch

-        for selection in protein_config.selection:
+        # Load refined structure and compute predicted density once per experiment
+        _structure = parse(_exp.refined_cif_path, ccd_mirror_path=None)
+        atom_array = get_asym_unit_from_structure(_structure)
+        if not hasattr(atom_array, "coord") or atom_array.coord is None:
+            raise AttributeError("AtomArray | AtomArrayStack is missing coordinates")
+        if not hasattr(atom_array, "b_factor"):
+            atom_array.set_annotation("b_factor", np.full(atom_array.coord.shape[-2], 20.0))
+
+        for selection in protein_config.selection:
             ...
-                _structure = parse(_exp.refined_cif_path, ccd_mirror_path=None)
-                atom_array = get_asym_unit_from_structure(_structure)
-                ...
                 _computed_density, _ = compute_density_from_atomarray(
                     atom_array, xmap=_base_xmap, em_mode=False, device=_device
                 )

You can also split cache keys so full base maps are cached by (protein, occ_a) and extracted submaps by (protein, occ_a, selection).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@scripts/eval/rscc_grid_search_script.py` around lines 127 - 165, The code
recomputes parsing and computed density inside the per-selection loop (parse,
get_asym_unit_from_structure, compute_density_from_atomarray), causing
O(selections) work; move CIF parsing and density computation out of the
selection loop and cache the results per experiment/occ_a. Specifically: parse
the refined CIF via parse(...) once for each _exp and produce
atom_array/get_asym_unit_from_structure once, then call
compute_density_from_atomarray(...) once to produce a cached _computed_density
(keyed by (protein, _exp.occ_a) or a separate computed_density_cache). Inside
the selection loop keep only loading/_base_xmap extraction and per-selection
_extracted_base handling, and reuse the cached _computed_density instead of
calling parse or compute_density_from_atomarray again.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/eval/rscc_grid_search_script.py`:
- Around line 180-193: The code mutates the shared Experiment object _exp across
selections causing state leakage; fix by working on a fresh shallow/deep copy
before any mutation — e.g., create a copy of _exp (use copy.deepcopy or convert
to dict first) and perform assignments to that copy (set rscc, base_map_path,
error) and then append the copy to results; ensure you stop mutating the
original _exp (referenced symbols: _exp, exp_dict_copy, selection, results,
rscc) so each selection row is independent.

In `@src/sampleworks/eval/eval_dataclasses.py`:
- Line 193: The CSV parsing currently uses
selection=row["selection"].strip().split(";") which can produce empty or
whitespace-only entries; update the parsing logic where the selection field is
built so you split on ";" then trim each item and filter out empty strings
(e.g., split, map strip, filter out falsy items) and then validate the resulting
list (e.g., ensure it's non-empty or that each item matches allowed choices)
before assigning to the selection field; adjust any error handling to raise a
clear exception or log invalid selections so downstream code that consumes
selection receives a normalized, validated list.

In `@src/sampleworks/eval/structure_utils.py`:
- Around line 369-373: The selection parsing in extract_selection_coordinates()
is inconsistent with apply_selection() and uses atom_array to compute the mask
(fragile for stacked inputs); update the condition to check for ("==", ">", "<",
"<=", ">=", " in ") and, in the else branch, compute mask from working_array
(use working_array.mask(selection) instead of atom_array.mask(selection)); keep
get_mask_from_old_selection_string(...) call for legacy parsing as-is and ensure
variable name mask is assigned from working_array when using the modern
selection path.

In `@src/sampleworks/utils/atom_array_utils.py`:
- Around line 295-303: After building altloc_ids and altloc_list via
find_all_altloc_ids and select_altloc, guard against cases where altloc_list has
fewer than two entries: if altloc_list is empty return an empty result (or the
function's appropriate zero-state), and if it contains exactly one element skip
calling filter_to_common_atoms and use that single selection as atom_arrays
(respecting return_full_array behavior); implement this check just before the
call to filter_to_common_atoms to avoid the downstream generic error.
- Around line 195-203: In find_all_altloc_ids(), hasattr(atom_array,
altloc_label) can be True even when the attribute value is None; change the
guard to fetch the attribute with getattr(atom_array, altloc_label, None),
validate that the returned altloc_ids is not None (and is the expected
array-like), and if it is None raise the same AttributeError (or a clear
ValueError) mentioning altloc_label and that the annotation is missing/None;
then proceed to call np.unique on altloc_ids and return set(altloc_ids.tolist())
- BLANK_ALTLOC_IDS.

---

Nitpick comments:
In `@scripts/eval/lddt_evaluation_script.py`:
- Around line 223-241: The loop currently calls get_reference_atomarraystack and
reparses the same reference for every selection because itertools.product over
args.occupancies and protein_config.selection repeats loads; change the logic to
iterate over occupancies first (for occ in args.occupancies), call
get_reference_atomarraystack(protein_config, occ) once, skip if None, then
inside that occupancy iterate over selections (for sel in
protein_config.selection) and call map_altlocs_to_stack(reference_proteins,
selection=translate_selection(sel), return_full_array=True) and store into
reference_atom_arrays[(protein_key, occ)][sel]; ensure the existing caching
check for (protein_key, occ) remains so each reference is only stored once per
occupancy.

In `@scripts/eval/rscc_grid_search_script.py`:
- Around line 127-165: The code recomputes parsing and computed density inside
the per-selection loop (parse, get_asym_unit_from_structure,
compute_density_from_atomarray), causing O(selections) work; move CIF parsing
and density computation out of the selection loop and cache the results per
experiment/occ_a. Specifically: parse the refined CIF via parse(...) once for
each _exp and produce atom_array/get_asym_unit_from_structure once, then call
compute_density_from_atomarray(...) once to produce a cached _computed_density
(keyed by (protein, _exp.occ_a) or a separate computed_density_cache). Inside
the selection loop keep only loading/_base_xmap extraction and per-selection
_extracted_base handling, and reuse the cached _computed_density instead of
calling parse or compute_density_from_atomarray again.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 14c38a2 and 51ee774.

📒 Files selected for processing (8)

scripts/eval/lddt_evaluation_script.py
scripts/eval/rscc_grid_search_script.py
src/sampleworks/core/rewards/real_space_density.py
src/sampleworks/eval/eval_dataclasses.py
src/sampleworks/eval/grid_search_eval_utils.py
src/sampleworks/eval/structure_utils.py
src/sampleworks/utils/atom_array_utils.py
src/sampleworks/utils/density_utils.py

k-chrispens

A few nits but won't hold up over them!

k-chrispens · 2026-03-02T19:06:25Z

    # this should be a temporary measure only until we switch to atomworks style in the RSCC script

-    logger.warning(
+    if any(x in selection for x in ("==", ">", "<", "<=", ">=", " in ")):


I think this is probably sufficient, but I am not sure. Maybe it would be better to detect the pyMOL like selection syntax? We should also write (or use if it exists) an atomworks/pandas <-> pyMOL style translator so that it is easy to port directly into pymol scripts.

I'll make an issue to solve this. I have a translation method but it only handles some cases. We definitely need a more complete solution.

k-chrispens · 2026-03-02T19:09:14Z



+def process_exp_with_selection(
+    exp: Experiment,


I'm curious - how should we name these given that we will be interacting with the diffuse hub schema? should we align how we describe experiments for ourselves both here and there?

It's a good question. Personally I think the way to do this is to push all the metadata into the CIF files, and then we can do away with this kind of parsing.

Oh, but you're asking whether we should rename the "Experiment" class here? Yes, it would make sense to rename it.

k-chrispens · 2026-03-02T19:10:44Z

+    ----------
+    exp: Experiment, a description of the structure generation experiment
+    protein_config: ProteinConfig, specifying the locations of reference structures and maps
+    px_seln_refernce_atom_array: AtomArrayStack,


typo here that would be worth not propogating I think: px_seln_refernce_atom_array -> px_seln_reference_atom_array

Good catch!

k-chrispens · 2026-03-02T19:18:40Z

            f"({n_valid}/{n_total} atoms remaining)"
        )

+    # TODO can we use [..., valid_mask] here?


I don't see why not, though I do think that coords is only going to be 2 dimensional here by requirement of AtomArrayStack or AtomArray

…tom selection capabilities for eval scripts)

marcuscollins added 5 commits February 27, 2026 15:18

core changes to enable multiple selections per protein

34eb10e

# Conflicts: # src/sampleworks/core/rewards/real_space_density.py # Conflicts: # src/sampleworks/utils/atom_array_utils.py

[WIP] Partially functional multi-selection code.

cfa619a

Prep work for parallelization of LDDT eval script

50ef437

Parallelizing LDDT calculations with joblib.

ba249f1

Parallelizing LDDT script with joblib. Closes #98

ruff/ty fixes

51ee774

marcuscollins requested a review from k-chrispens March 1, 2026 03:06

coderabbitai Bot reviewed Mar 1, 2026

View reviewed changes

k-chrispens approved these changes Mar 2, 2026

View reviewed changes

Addressing coderabbit and k-chrispen's f/b on PR 116 (new structure/a…

2945059

…tom selection capabilities for eval scripts)

marcuscollins merged commit 74b007d into main Mar 2, 2026
1 check was pending

marcuscollins deleted the new-selection-lists branch March 2, 2026 23:27

Conversation

marcuscollins commented Mar 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

k-chrispens left a comment

Choose a reason for hiding this comment

Uh oh!

k-chrispens Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

marcuscollins Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

k-chrispens Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

marcuscollins Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

k-chrispens Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

marcuscollins Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

k-chrispens Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

marcuscollins commented Mar 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 1, 2026 •

edited

Loading